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Abstract 

Categorization axioms have been proposed to 
axiomatizing clustering results, which offers 
a hint of bridging the difference between hu¬ 
man recognition system and machine learn¬ 
ing through an intuitive observation: an ob¬ 
ject should be assigned to its most similar 
category. However, categorization axioms 
cannot be generalized into a general machine 
learning system as categorization axioms be¬ 
come trivial when the number of categories 
becomes one. In order to generalize cate¬ 
gorization axioms into general cases, cate¬ 
gorization input and categorization output 
are reinterpreted by inner and outer category 
representation. According to the categoriza¬ 
tion reinterpretation, two category represen¬ 
tation axioms are presented. Category repre¬ 
sentation axioms and categorization axioms 
can be combined into a generalized catego¬ 
rization axiomatic framework, which accu¬ 
rately delimit the theoretical categorization 
constraints and overcome the shortcoming 
of categorization axioms. The proposed ax¬ 
iomatic framework not only discuses catego¬ 
rization test issue but also reinterprets many 
results in machine learning in a unified way, 
such as dimensionality reduction, density es¬ 
timation, regression, clustering and classifi¬ 
cation. 
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1 Introduction 


PA C theory d Valiant . 1984 1. statistical learning the¬ 
ory ( Vapnikl . 2000i l and so on. However, a six or seven 
year old child can easily and correctly categorize many 
objects and does not understand about the above men¬ 
tioned machine learning theories. Therefore, there ex¬ 
ists a clear gap between human recognition system and 
machine learning theories. 

In Yu and Xul ( 2014 1. categorization axioms have been 
proposed to axiomatizing clustering results, which the¬ 
oretically offers a hint of bridging the difference be¬ 
tween human recognition system and machine learn¬ 
ing by an intuitive observation: an object should be 
assigned to its most similar category. Assumed that 
c > 1 and the object representation o f the in put is 
the same as that of the output, lYu and Xul (l20l4 
have proposed representation of clustering results and 
studied clustering results based on categorization ax¬ 
ioms. However, t he p ro posed r epre sentation for clus¬ 
tering results in ( Yu and Xul . 1201411 may be not avail¬ 
able for many machine learning algorithms. For ex¬ 
ample, when the number of categories becomes one, 
categorization axioms become trivial as they are al¬ 
ways true. In the literature, many learning algorithms 
such as manifold learning and regression belong to one 
category learning problem. In order to generalize cat¬ 
egorization axioms, categorization is needed to be fur¬ 
ther investigated. 

According to the above analysis, several improvements 
on categorization axioms are made in this paper as 
follows: 


1) A unified categorization representation is put for¬ 
ward and similarity operator and assignment operator 
are defined. 


2) Category representation is axiomatized by two ax¬ 
ioms, which includes existence axiom of category rep¬ 
resentation, and uniqueness axiom of category repre¬ 
sentation. 


Up to now, many elegant but complex machine learn- 3) Three principles of developing categorization meth- 
ing theories are developed for categorization, such as ods are investigated under new proposed categoriza- 





























tion representation. 

4) Categorization test is discussed by categorization 
test axiom and categorization robustness assumption. 

5) Density estimation, regression, classification, clus¬ 
tering and dimensionality reduction are reinterpreted 
by the proposed axioms. 

The remainder of the paper is organized as follows: In 
section 2, a unified categorization representation is dis¬ 
cussed and two axioms of category representation are 
presented. In section 3, three categorization axioms 
are reinterpreted under new categorization representa¬ 
tion. In section 4, how to theoretically evaluate a cate¬ 
gorization algorithm is discussed. In section 5, how to 
design a categorization method is discussed. In section 
6, as applications of the proposed categorization ax¬ 
iomatic framework, dimensionality reduction,density 
estimation, regression, clustering and classification are 
reinterpreted in a unified way. The final section offers 
concluding remarks. 


2 Category Representation Axioms 


In cognitive sciences, a basic principle for human 
recognition system is that an object should be assigned 
to its most similar category. For human being, mem¬ 
bership explicitly represents that an object is assigned 
to some category and must be observed by others, sim¬ 
ilarity between an object and a category may be im¬ 
plicit and may not be observed by others. In other 
words, human beings has two category representations 
for categorization, membership is explicit and is called 
outer category representation, similarity may be im¬ 
plicit and belongs to inner category representation. 
According to cognitive science, inner category repre¬ 
sentation for a category is in the mind of human be¬ 
ings, which may be different from the outer category 
representation. Human being establish the relation 
between objects in the world and corresponding con¬ 
cepts in the mind by two category representations for 
categorization. For categories, a categorization algo¬ 
rithm should also have inner and outer category rep¬ 
resentations in order to reflect the relation between 
ob ject s in the world and the corresponding categories 


as 


Yu and Xul ( 20141 ) have done for clustering results. 


Consi dered the limits of the proposed representation 
in (IYu and Xul . I20141) . we will reinterpret how to de¬ 
fine the inner and outer category representation in a 
categorization algorithm in the following. 


Any algorithm has the input and the output. For a cat¬ 
egorization algorithm, the input is called categoriza¬ 
tion input and the output is called categorization re¬ 
sult. Categorization input should have inner and outer 
representation. Inner categorization input is expected 


to be learned with respect to the outer categorization 
input. Similarly, Categorization output should have 
inner and outer representation. Inner categorization 
output is actually learned with respect to the outer 
categorization output. 

The outer categorization input is about the prede¬ 
fined categorization information of the sampling ob¬ 
jects O = {oi, 02 ,--- , o n }, including the input object 
representation and the corresponding outer category 
representation. 


The input object representation is represented by A' = 
{x\, x 2 , • • • , x n } with c subsets Xi, X 2 , • • • , X c , where 
Xk represents the k th object o k , X, is a set that con¬ 
sists of all the objects of the i th category in the dataset 
X. The outer category representation for the catego¬ 
rization input can be represented by U = [uik]cxn, 
\/iVk,Uik > 0 represents the membership of the object 
Xk belonging to the i th category. Hence, the outer cat¬ 
egorization input can be re presented by (X , U). More 
detailed can be seen in ( Yu and Xul . l2014h . When U 
is known, one object should be assigned to the cat¬ 
egory with biggest membership. Therefore, assign¬ 
ment (outer referring) operator —► can be defined as 
X = {x\,x 2 , ••• , x n }, where x k = argmax; u ik . 


Similarly, the outer categorization result can be 
expressed by (Y,V), where Y = {yi, 2 / 2 , ■ ■ ■ , y n } 
represents the object representation for the out¬ 
put, iik also represents the k th object Ok , and 
Yi,Y 2 ,-" Ac represents the corresponding input c 
subsets Xi,X 2 ,-■ ■ ,X C , V is the outer category 
representation for the output, V = [i’ik]cxn = 
,v n ] is a partition matrix, \/iSJk,Vik > 0 
represents the membership of the object yk belong¬ 
ing to the I th category and v k = [wife, v 2k , • • • ,v ck } T . 
Similarly, assignment operator —> is defined as Y = 
{yi, i/ 2 , • • • , Vn}: where y k = argmaxj v lk . If x k .y k are 
single value, x k belongs to the x^ 1 category, y k belongs 
to the y^ 1 category. In common sense, assignment op¬ 
erator —> represents outer referring and reflects the 
external relation between the object and the category. 


As pointed out by Yu and Xul ( 20141) . the cognitive 
representation of a category is always supposed to ex¬ 
ist, even in an implicit state when designing a cat¬ 
egorization algorithm. For simplicity, when the in¬ 
put X = {x\,x 2 , • • • , x n } is categorized into c subsets 
X\, X 2 , * * • ,X C , Vi, Xj_ is supposed to be the cogni¬ 
tive representation of the i th category , and the out¬ 
put Y = {yi, j/2, • • ■ , yn} is categorized into c subsets 
Yi, Y 2 , ■ ■ ■ , Y c , Vi, Yj_ is supposed to the cognitive rep¬ 
resentation of i th category. 


As pointed out bv lYu and Xu ( 2014 ). when the cogni¬ 
tive representation for any category is defined, objects 
can be categorized based on the similarity between ob- 






























jects and categories. As the input is usually different 
from the output, the input category similarity map¬ 
ping and the output category similarity mapping can 
be defined by computing the similarity between ob¬ 
jects and categories as follows. 

Input Category Similarity Mapping: 

Simx- X x {Aii X 2 , ■ • • , Xc} i->- R + is called category 
similarity mapping if an increase in Simxjxk, Xj) indi¬ 
cates greater similarity between Xk and Xj, a decrease 
in SimxjxkT Xj) indicates less similarity between Xk 
and Xj . 

Output Category Similarity Mapping: 

Simy- y x { Y, Y 2 , • • • , Y{} i-»- R + is called category 
similarity mapping if an increase in Simy(yk , Yj) indi¬ 
cates greater similarity between Yk and Y , a decrease 
in Simy {yk, Yj) indicates less similarity between yk 
and Y . 

For input category similarity mapping, similarity (in¬ 
ner referring) operator ~ can be defined as A' = 
{5?i, a? 2 , • • • ,x n }, where Xk = argmaxj Simxjxk, Xj). 
Similarly, for output category similarity mapping, 
similarity operator ~ can be defined as Y = 
{yi,V2,--- ,yn}, where y k = argmaxj Sim Y (y k ,Yi). 
It is easy to know that if yk is single value, the larger 
Simy(yk,Yy k ), the better Simy. Similarly, if Xk is sin¬ 
gle value, the larger Simx{xk, X^ k ), the better Simx- 

Similarly, the input category dissimilarity mapping 
and the output category dissimilarity mapping can be 
defined as follows: 

Input Category Dissimilarity Mapping: 

Dsx- X x {Xi,X 2 , • • • ,XJ i ^ R + is called category 
dissimilarity mapping if an increase in Dsx{xk, Xj) 
indicates less similarity between Xk and Xj, a decrease 
in Dsx {xk , Xj) indicates greater similarity between x-k 
and Xj. 

Output Category Dissimilarity Mapping: 

Dsy. Y x {Yj, Y 2 , • • • , be} i-»- R + is called category 
dissimilarity mapping if an increase in Dsy(yk,Yi) in¬ 
dicates less similarity between yk and Y, a decrease 
in Dsyjvk, Yj) indicates greater similarity between yk 
and Yo 

For input category dissimilarity mapping, similarity 
operator ~ can be defined as X = { Xi,X 2 , ••• ,x n }, 
where Xk = argminj Dsx(xk, Xj). Similarly, for out¬ 
put category dissimilarityjmapping, similarity opera¬ 
tor ~ can be defined as Y = {yi,j/ 2 , ••• ,y n }> where 
yk = argminj Dsv(j/fe,b))- If Xk is single value, the 

'in order to be consistent with the intuition, category 
similarity mapping and category dissimilarity mapping are 
usually supposed to be non negative in this section. In ap¬ 
plications, category similarity mapping and category dis¬ 
similarity mapping can be negative. 


less Dsy(yk,Yy k ), the better Dsy. Similarly, the less 
Dsx(xkiYx k ), the better Dsx, where Xk is single value. 
If x k and yk are single value, Xk is said to be similar 
to the xj ! 1 category, y k is said to be similar to the 
y f k h category. In daily life, similarity operator ~ repre¬ 
sents inner referring and established the latent relation 
between the object in the world and the cognitive cat¬ 
egory representation. 

According to the above analysis, when the outer cate¬ 
gorization input is (X, U), its corresponding inner cat¬ 
egorization input can be represented by (X, Simx) 
or by (X, Dsx), where X = |Xj,X 2 , • • • ,X c j. 
For brevity, (X, 17, X, Simx) or by (. X,U,X_,Dsx ) 
is called the categorization input. (X, Simx) or 
(X, Dsx) are the inner category representation for the 
input, simply, called inner input. 

Likely, when the outer categorization result is (Y, 17), 
its corresponding inner categorization result can be 
represented by (Y_,Simy) or by (Y, Dsy), where 
Y = {Yj, Y 2 , • • • , Y7J-. For brevity, (Y, V, Y, Simy) or 
by (Y,V,Y_,Dsy) is called the categorization result. 
(Y, Sim Y ) or (Y_,Dsy) are the inner category repre¬ 
sentation for the output, simply, called inner output. 
If a categorization algorithm can explicitly output Y, 
such a categorization algorithm can be called white 
box. If a categorization algorithm can not explicitly 
output Y but only explicitly output (Y, V), such a cat¬ 
egorization algorithm can be called black box. If a cat¬ 
egorization algorithm can explicitly output parts but 
not full of Y, such a categorization algorithm can be 
called grey box. 

For a categorization algorithm, its outer input and 
outer output should have the corresponding inner cat¬ 
egory representations. Therefore, we call it Existence 
Axiom of Category Representation (ECR). More ac¬ 
curately, it can be expressed as follows: 

1) ECR : 

For a categorization algorithm, if its outer input is 
(X, U) and its outer output is (Y, V), then there exists 
the corresponding inner input {X^,Simx) and inner 
output (Y, Simy). 

For a categorization algorithm, the input is expected 
to have the same category representation as the out¬ 
put with respect to categorization. (X, Simx) and 
the corresponding output (Y, Simy) is considered to 
have the same category representation with respect to 
categorization if (X, X) = (Y, Y). (X,U) and (Y,V) 
is considered to have the same category representa¬ 
tion with respect to categorization if A" = Y. Such 
an assumption is called Uniqueness Axiom of Cate¬ 
gory Representation (UCR), which can be expressed 
as follows: 




2) UCR : 

For a categorization algorithm, its categorization input 
(A, [/, A, Simx) and its corresponding categorization 
output (Y,V,Y, Simy) should satisfy (A, A, A) = 
(Y,Y,Y). 

ECR and UCR are called category representa¬ 
tion axioms. (X , U , X , Simx) represents the cate¬ 
gory information by the outer information provider, 
(Y, V, Y. Simy) represents the category information 
by the categorization algorithm, ( X , Simx) is ex¬ 
pected to be learned and represents the inner category 
representation of the outer information provider, and 
(Y, Simy) is actually learned and represents the in¬ 
ner category representation of the categorization algo¬ 
rithm. UCR offers the conditions that learning can be 
perfectly accomplished, which states that the catego¬ 
rization input and the categorization output have the 
same categorization semantics. Sometimes, X = Y 
can be further enhanced into U = V. 


3 Reinterpretation of Categorization 
Axioms 


According to Yu and Xu! ( 2014) , categorization axioms 
includes Sample Separation Axiom (SS), Category 
Separation Axiom(CS) and Categorization Equiv¬ 
alency Axiom (CE). For a categorization result 
(Y,V,Y, Simy), SS, CS and CE can be reinterpreted 
by similarity operator and assignment operator as fol¬ 
lows. 


1) SS: Vk3i(y k =i) 

2) CS: V*3fc(y fc = »)) 

3) CE: Y = Y 

Moreover, we can prove Theorem [T] 

Theorem 1. If\/k\/i Vj((j / i) —> ( Simy(y k ,Y) ^ 
Simy(yk,Yj))), then SS must hold. 


When a categorization result is not proper, there are 
some objects theoretically belonging to two and more 
categories. In other words, some objects are in the bor¬ 
derline of some category. Based on this fact, boundary 
set can be defined as follows. 

Boundary set: For a categorization result 

(Y,V,y,Simy), the boundary set for (Y,Y_, Simy) is 
defined as follows. 


BS (Y,Y,Sim Y ) = {yk I card{y k ) > 1} 

where card(y k ) represents the cardinality of a set y k . 

Transparently, the above analysis also holds for the 
categorization input (X. U , X , Simx). Therefore, 
(A, t/,A, Simx) should also satisfy SS, CS and CE. 
For brevity, we will not repeat the similar result. More 


interestingly, some relation can be established between 
UCR and CE by Theorem [2j 

Theorem 2. If the categorization input 

(A, {/, A, Simx) and the categorization result 

(Y, V,Y , Simy) satisfy CE, then X = Y is equivalent 
to X = Y. 

As noted above, the input x k and the corresponding 
y k represent the same object o k . Generally speak¬ 
ing, the input x and the corresponding output y 
represents the same object o, therefore, it is natu¬ 
rally assume that there exists a mapping 6 from x 

to y, i.e. y = 9(x). When A = Y, it is easy 

to know that Simy{y k ,Yi) = Simy(9{x k ),Yi) = 
Simy(9(x k ), Xj). Hence, Simx(x k , Xj) can be de¬ 
fined by Simy(0(x k ), Xj). Therefore, it is easy to 
know that A = Y_ implies that A = Y when 
Simx{x k , Xj) is defined by Simy(9(x k ), Xj). By The¬ 
orem [2] and the above analysis, A = Y_ play an es¬ 
sential role in UCR. In particular, when c=l, it is 
easy to know that A = Y and A = Y hold triv¬ 
ially, A = y is the only meaningful requirement in 
UCR. Moreover, categorization axioms and UCR offer 
the conditions that category similarity mapping should 
satisfy, and states that the input category similarity 
mapping should be equivalent to the output category 
mapping with respect to categorization, which is called 
similarity assumption. For categorization, it is very 
challenging to design a proper output category simi¬ 
larity mapping satisfying UCR and categorization ax¬ 
ioms. Usually, the input category similarity mapping 
is not equivalent to the output category mapping with 
respect to categorization in practice, which is called 
similarity paradox. If similarity paradox occurs, the 
categorization error will be not zero. According to the 
above analysis, the key to solve similarity paradox is to 
keep A = Y_ to be true. As a matter of fact, it is often 
true that A ^ Y_. Therefore, how to solve similarity 
paradox is an eternal problem in categorization. 

In summary, category representation axioms and cat¬ 
egorization axioms have established the relationships 
among all the parts related to categorization input and 
categorization output, as shown in Figure [TJ UCR es¬ 
tablishes the categorization equivalence between the 
input and the corresponding output. Categorization 
axioms only establish the relationships between the 
outer representation and the corresponding inner rep¬ 
resentation and do not reflect the relation between the 
input and the output. If the object representation can 
be theoretically generated by the corresponding cogni¬ 
tive representation, then the corresponding cognitive 
representation is called generative. If the object repre¬ 
sentation can not be theoretically generated by the cor¬ 
responding cognitive representation but can decide the 
corresponding cognitive representation, then the cor- 








responding cognitive representation is called discrim¬ 
inative. If the cognitive representation is generative, 
the corresponding learning model is called generative 
model. If the cognitive representation is generative, 
the corresponding learning model is called discrimina¬ 
tive model. 


In particular, Let X = Y and UCR be true, 
( X.Simx ) and {Y_,Simy) are exchangeable with re¬ 
spect to categorization. Under such assumptions, 
(. X,U,X_,Simx ) can be used to represent the cate¬ 
gorization results, wh ere (X , Simx ) actually denotes 
G Y,Sim v ). In lYu and xJ (l20l4 . ECR and UCR 
are implicitly assumed to be true, such an assump¬ 
tion makes CE not be true as it is very difficult for 
(Y_, Simy ) to have the same categorization capacity 
as ( X , U ) in practice, especially for U is given a priori. 


4 Categorization Test 

All the above analysis does not discuss how to evalu¬ 
ate the categorization result (Y, V. Y_, Simy). Frankly 
speaking, it is very challenging to test the performance 
of a categorization algorithm. When estimating the 
categorization performance, a test set (Xt, Ut) is usu¬ 
ally provided and (X, U ) is called the training set. 
According to the analysis in section [2j ( Xt , Simx T ) 
exists. Similarly, if (Xt,Ut,Xt, Simx T ) is used the 
categorization input, the corresponding categorization 
output can be represented by (Yt, Vt, Yt, Simy T ). 

It is easy to know the test set and the training set are 
supposed to represent the same categorization for the 
same categorization algorithm. Therefore, Categoriza¬ 
tion Test Axiom can be expressed as follows: 

Categorization Test Axiom: For a categorization 
algorithm, if its training test is (A', U) and its test set 
is (Xt,Ut), then (A, Simx)=(Xr_, Simx T )- 

Certainly, categorization test axiom offers the prereq¬ 
uisite condition that a categorization algorithm has 
generalization ability, which is a demanding require¬ 
ment for categorization. It is easy to prove that catego¬ 
rization test axiom can infer the objects in the training 
set and the test set should be independent and identi¬ 
cally distributed if objects are random variables. 

Usually, A only approximates X t - Sometimes, the 
difference between A and Xt is so big that A and Xt 
cannot be considered to represent the same categoriza¬ 
tion. In this case, the test result will be not credible 
and it can not be checked whether the corresponding 
categorization algorithm has generalization ability or 
not. 

In fact,A and Xt are unobservable and unknown, it 
is very difficult to measure the difference between A 


and Xt- Instead of measuring the difference between 
A and Xt, one estimation method is to compute the 
difference between (A, U) and (Xt, Ut), the other es¬ 
timation method is to compute the difference between 
Y and Yr_ assuming that UCR holds or approximately 
holds at least. Theoretically, the difference between X 
and Xt should be proportional to the difference be¬ 
tween Y_ and Yt in the ideal case. Therefore, the cate¬ 
gorization robustness assumption can be described as 
follows: 

Categorization Robustness Assumption: A cat¬ 
egorization algorithm is called robust if there exist two 
constants k\ and k 2 such that k\\Y_— Yt\ < |A — Xt\ < 
k‘2 1 Y — Yt\ , where 0 < k\ < k 2 ■ 

Categorization robustness assumption demonstrates 
the global condition that the corresponding categoriza¬ 
tion algorithm has generalization ability when catego¬ 
rization test axiom does not hold. If categorization 
test axiom holds, a good categorization axiom should 
make | Y_ — Yr_\ as small as possible, When categoriza¬ 
tion test axiom dose not hold, it is very challenging 
to check whether or not categorization robustness as¬ 
sumption holds as A and A t are usually not known. 
Therefore, a substitutional method is to compute the 
distance between the outer representations. Such an 
idea leads to local categorization robustness assump¬ 
tion as follows: 

Local Categorization Robustness Assumption: 

A categorization algorithm is called locally robust 
if there exist two constants k\ and k 2 such that 

ki\(Y,V) - (Y T ,V T )| < | (X,U) - (X T ,U T ) | < 
k 2 \ (Y, V) - (Yt,Vt) \, where 0 < k\ < k 2 , (X,U) is 
a training test and (Xt, Ut) is a test set. 

Transparently, if local categorization robustness as¬ 
sumption is satisfied with respect to \(X,U) — 
(Xt, Ut )| < e where e is a very small positive number, 
the corresponding algorithm can be stably evaluated 
in theory. 


Design Principles of Categorization 
Methods 


When cat e goriza tion axioms are proposed by 
Yu and Xul (|2014r) . three design principle s of cluster¬ 
i ng m ethods have also been proposed by Yu and Xul 
(120141) . However, thre e desig n prin ciples of clustering 
methods proposed by lYu and Xul ( 2014h need to be 
reinterpreted when categorization is investigated. It 
is easy to guess that five axioms are also useful for 
developing categorization methods when five axioms 
are proposed to deal with categorization algorithms. 
Clearly, five axioms do not have equal importance 
when designing a categorization method. ECR only 




















Figure 1: Relationship between a categorization input (X, U, X_, Simx) and its corresponding categorization 
result (Y, V, Y, Simy ) 


tells us how to represent the categorization input and 
the categorization output. CE is always supposed 
to be true for a categorization algorithm since the 
outer referring and the corresponding inner referring 
should represent the same referring, in a word, the 
explicit function of a categorization algorithm should 


be the same as its intern ally implemented function. 
As pointed by lYu and Xul ( 2014T) . SS and CS offer 


a very low bar for clustering results. Similarly, SS 
and CS are also loose requirements for categorization. 
UCR is far demanding as it requires three equivalence 
conditions are true simultaneously. Therefore, three 
design principles of categorization methods can be 
inferred from SS, CS and UCR. In the following, 
we will carefully investigate such three principles 
respectively under the proposed axiomatic framework. 


compactness. 

According to categorization axioms, category com¬ 
pactness criterion can be equivalently defined by 
Jc{X,U,X_,Dsx )■ In the literature, it is often seen 
that Jc(X,U,X_, Ds x ) = X!, Y^ u ikDsx(x k , Xj). 

As the relevance among (X, U, X_. D.sx ), 
Jc(X,U, X_, Dsx) can be further simplified into 
Jc{X,2t,Ds x ) or Jc(U). Noticing the definition of 
category similarity mapping, category compactness 
principle is still available for categorization when 
c = 1. 


5.2 Category Separation Principle 


5.1 Category Compactness Principle 

Theorem [l] shows that the conditions of SS are nearly 
no requirement as the conditions of Theorem [T] are of¬ 
ten true in general case for a well design ed cate gory 
similarity. Following the same analysis in lYu and Xul 
(ml, SS should be enhanced into category compact¬ 
ness principle as follows: 

Category Compactness Principle: A categoriza¬ 
tion method should make its categorization result as 
compact as possible. 

Category compactness principle says that every cate¬ 
gory should be as much compact as possible. Under 
the proposed representation of the categorization re¬ 
sult, category compactness criterion can be defined as 
follows. 

Category Compactness Criterion: Jc '■ { Y, V} x 

{Y, Dsy} H> R + is called category compactness crite¬ 
rion if the optimum of Jc(Y,V,Y_, Dsy) corresponds 
to the categorization result with the largest category 


If a categorization result (Y,V,Y_, Simy) satisfies CS, 
then VI < i ^ j < c, Y t ^ Y :] . According to the same 
Yu and Xu ( 2014H . CS can be enhanced into 


reason m 


category separation principle as follows: 


Category Separation Principle: A good catego¬ 
rization result should have the maximum distance be¬ 
tween categories. 


Under the proposed representation of the categoriza¬ 
tion result, category compactness criterion can be de¬ 
fined as follows. 


Category Separation Criterion: 

Js : {Y, V} x {Yj.Yj, ■ ■ ■ , Y c \ i->- R + is called 
category separation criterion if the optimum of 
Js ( Y, V, {Yi, Y 2 , • • • ,Yc}) corresponds to the catego¬ 
rization result with maximal category separation. 

Category separation principle requires that c > 1. In 
other words, when c = 1, category separation principle 
is unavailable. 





































5.3 Categorization Consistency Principle 

If the categorization input (X,U, X, Simx) and its 
corresponding categorization result (Y. U, Y, Simy) 
satisfy UCR, the categorization error is zero. How¬ 
ever, even for human recognition systems, UCR can 
not be always guaranteed to be true. Generally, hu¬ 
man recognition systems always try to make catego¬ 
rization error as small as possible. Therefore, UCR is 
the most demanding requirement for categorization. If 
UCR does not hold, a reasonable categorization crite¬ 
rion should make UCR hold as approximately as pos¬ 
sible, which result in categorization consistency prin¬ 
ciple as follows: 

Categorization Consistency Principle: When 
UCR does not hold, a good categorization result 
should make UCR as approximately correct as pos¬ 
sible. 

When UCR does not hold, categorization consistency 
principle can be used to design some categorization 
criterion as follows: 

Categorization Consistency Criterion: Je : 

{.X,X,X,X}} x {Y,Y,Y,Y} R+ is called cat¬ 
egorization consistency criterion if the optimum of 
Je(X, X, X, X, Y, Y. Y, Y) corresponds to the catego¬ 
rization result with the minimum difference between 
(X,X,X) and (Y,Y,Y). 

Clearly, if UCR can not be true, categorization con¬ 
sistency principle should be the first principle when 
designing a categorization algorithm no matter what 
the number of categories is. Frankly speaking, it is 
not usually expected that (X, Simx) and (Y, Simy) 
are obtained simultaneously. Usually, ( X_,Simx ) is 
interchanged or approximated by (Y, Simy) when de¬ 
signing a categorization algorithm. In many catego¬ 
rization algorithms, UCR is supposed to be true but is 
not actually true. Under such an assumption, category 
compactness principle and category separation princi¬ 
ple should be used to design categorization methods. 

5.4 Occam’s razor 

For a specific categorization problem, there exists 
many categorization models. Category compactness 
principle, category separation principle and catego¬ 
rization consistency principle just select the optimal 
parameters in the candidate models with the same 
inner category representation, and cannot choose the 
optimal models among different inner category repre¬ 
sentations. How to select an appropriate categoriza¬ 
tion model among different inner category represen¬ 
tations? Occams razor principle is a popular tool for 
human being to choose models among different repre¬ 
sentations, which states that ’’plurality should not be 


posited without necessity”. Therefore, a simpler cate¬ 
gorization model should be selected among the candi¬ 
date models with the same performance. 

What is a simple categorization model? As the cate¬ 
gorization problem can be represented by the catego¬ 
rization input (X, U, X, Simx) and the corresponding 
categorization output (Y, V, Y, Simy). a model with 
the simple categorization input and output will be 
considered simple. When c= 1, then Vfc,£fc = 1 and 
\/k,Xk = 1. Therefore, it is enough to study X and Y 
in order to obey UCR or its approximated version: cat¬ 
egorization consistency principle, (U, Simx , V, Simy) 
can be omitted when designing a categorization model. 
If such an assumption holds, it can be considered as 
a simple categorization problem. Otherwise, if c > 2, 
assume Y = X, V can be replaced by Simy because 
CE always hold, hence, (Y, V,Y, Simy) can be repre¬ 
sented by (Y, Simy). Similarly, (X, C/,X, Simx) can 
be represented by (X,U). In this case, it is enough 
to deal with (X, U, Y, Simy) for such a categorization 
problem. Clearly, it is also a simple categorization 
case. Of course, such simplified categorization models 
can be further simplified by selecting simpler Y. In 
summary, Occam’s razor can be used to discuss cate¬ 
gorization model complexity. In the following, we will 
study categorization models according to model com¬ 
plexity in the Occam’s razor point of view. 

6 Applications 

In this section, we will study categorization models 
according to analysis in section l5ril When c = 1, cat¬ 
egorization becomes one category problem, including 
density estimation, regression and some dimensional¬ 
ity reduction methods. When c > 1, categorization 
is multiple category problem, including clustering and 
classification. When U is not know for c > 1 before 
categorization, categorization is a clustering problem, 
when U is known for c > 1 before categorization, cate¬ 
gorization is a classification problem. In the following, 
the above issues will be discussed based on the pro¬ 
posed axioms and principles. 

6.1 Unsupervised Dimensionality Reduction 

In the following, we will give several examples to show 
how to interpret dimensionality reduction methods 
based on the proposed axioms and principles. 

For simplicity, assume that X = [xk r ]nxp are sampled 
from some underlying structure in a space with dimen¬ 
sionality p , and such a sample can also be represented 
by Y = [ykr]nxd in a low dimensional space with di¬ 
mensionality d , where p » d. Such a categorization 
problem is called dimensionality reduction. 


If XJ is not known, such a problem is called unsuper¬ 
vised dimensionality reduction. It is easy to know that 
unsupervised dimensionality reduction has the cate¬ 
gorization input (X, U, X, Dsx) and the categoriza¬ 
tion output (Y,V,Y_, Dsy)- Therefore, unsupervised 
dimensionality reduction can be considered a catego¬ 
rization problem. In this section, we further assume 
that c = 1. Under this assumption, it is easy to know 
that X = Y and X = Y. UCR only requires that 
X = Y. If UCR does not hold, categorization consis¬ 
tent principle naturally requires that X approximates 
Y as much as possible. If UCR does hold, category 
compactness principle implies that the best X should 
make the underlying category the most compact. 


PCA( Pearsonl . Il901 : 


Hotelling, 


Abdi and Williams, 2010h : Let X = Y = 


19331: 

x 0 

VJ\ 

W2 


. Wd . 

represent the ordered orthonormal basis 
{wi,W2, ■ ■ ■ , Wd} with the origin x 0 , Y = [y kr ]nxd are 
the coordinates of the objects O = {oi, 02 , • • • , o„} in 
the ordered orthonormal basis {wi, u> 2 , • • • ,Wd} with 
the origin xq. Then we know that WiwJ = 5\j, Sij = 1 
if i = j, Sij = 0 if i ^ j, ykr = (Xk - xo)wj, x 0 , Wi are 
lxp vector. 


Let Ds x {x,X ) = (x - x 0 - J2i( x ~ x o)wf Wi)(x - 
xo — JU(x ~ Xo)wfWi) T represent the dissimilarity 
between x and the category representation X , it 
is easy to prove that Dsx(x,2L) = (x — Xo)(x — 
xo) T — Y^i w i( x ~ xq) t (x — xq)wJ . Obviously, if x 
can be a linear combination of the ordered orthonor¬ 
mal basis {wi, W 2 , • • ■ ,Wd} with the origin xo, then 
Dsx(x, X) = 0 means x can be perfectly represented 
by Y_. If Vxfc, Dsx {xk , X) = 0, then Vx/t have the co¬ 
ordinates of the objects O = {oi,c> 2 ,--- , o ra } in the 
ordered orthonormal basis {wi,W 2 , - ■ ■ ,Wd} with the 
origin xo with zero residual. In general cases, it is not 
true that Vx*,, Dsx(x k , X) = 0. 


As UCR holds, category compactness principle will be 
used to seek the best X, which means that a good X 
should minimize the objective function ID subject to 
VWj, WiwJ = S^. 


min E Dsx(xk,X ') 

~ k 

= E^ ~ X 0 ){Xk - X 0 ) T 

k 

- ^Wi^iXk - xof(x k - X 0 ))wf 
i k 


(1) 


By Lagrange multiplier method, the objective function 


can be rewritten as © 

L = ^(x fc - x 0 ){x k - xo) T 
k 

~ E^'E^ ~ Xo) T {x k - Xq)wJ (2) 

i k 

- 1) 
i 


The equations © can be obtained by differentiating 

©. 


()L 

dxo 

dL 

dwi 


-2^(x fc - x 0 )(I p - ^2 wfwi ) = 0 

k i 

2vJi E( Xfe ~ x o) T i x k - * 0 ) - 2A iWj = 0 

k 


( 3 ) 


Hence, the solution of minimizing m subject to 
\/iVj,WiwJ = S^ is as ©. 


Xo 


E x k 

N 


Wi ^(x fe - xo) T {x k - Xo) 
k 


XiWi 


( 4 ) 


The equation 0 and minimizing m can introduce 
the traditional principle component analysis. The pro¬ 
posed axiomatic framework of categorization has of¬ 
fered a new interpretation of principle component anal¬ 
ysis. 


NMF(Lee and Seung, 1999 1: 

w 1 

[h k r)nxd, X = Y = W = 


W 2 


Let Y = H = 
represent the or- 


. Wd . 

dered basis {uq,u> 2 ,--- ,Wd}, Y = [h kr ] n xd are the 
coordinates of the objects O = { 01 , 02 , • • • , o n } in the 
ordered basis {w\,W 2 , ■ ■ ■ ,Wd}, where all the elements 
in Wi are negative and Vfc, r, h kr are negative. 


Let Ds x (x k ,X ) = (x k - h kiWi)(x k - J^i h ki Wi) T . 
As UCR holds, category compactness principle will be 
used to seek the best X, which means that a good X 
should minimize the objective function ([5]). 


min E Ds x {x k ,X_) 

k 

= ^(x fe - ^2 h ki Wi)(x k - '^2h ki w i ) T (5) 

k i i 

= \\X-HWf 


Minimizing © in troduces no nnegative matrix factor¬ 
ization ( Lee and Seune . 1999h . 































CCA (|Hotellinel Il936h : Let X = 


Xa 1 

1*5*1 


and Y = 


, where a is 1 x p vector, b is 1 x d vector. How¬ 
ever, A = Y_ does not hold in general, UCR is not 
true. Therefore, we should use categorization consis¬ 
tence principle, which means to minimize the objective 
function 0. 


min L(X,Z) = |A — Y\ 2 


(Xa T ,Yb T ) 

\Xa T \\Yb T \ 


Xa T 

\X^\ 


Yb T 


( 6 ) 


Obviously, minimizing 0 is equivalent to maximizing 

CD 

(Xa T , Yb T ) aX T Yb T ^ 

|Aa T ||y& T | “ VaX T Xa T VbY T Yb T ' ' 


Hence, canonical correlation analysis is introduced by 
maximizing CD- 


LLEf Roweis and Saull . l2000h : Let A = 

W X = [W k l\nxn, Ds X (Xk,2Q=Ds X (Xk,W)=\x k ~ 
I2jeN(k) w kj x j\ 2 , where J2i w u = 1 ,w k i > 0, w k i = 0 
if l N(k), N(k) = {j\xj is the neighbor of x k }. 


As UCR holds, category compactness principle will 
be used to seek the best A. According to category 
compactness principle, a good category representation 
A = W should minimize the objective function (0 : 


Ds x {x k ,W) = J2\xk- J2 w V x i\ 2 ( 8 ) 

k k j£N(k) 


According to UCR, A = Y_ implies that Y = W. Set 

Ds Y (y k ,Y)=Ds Y {y k ,W)=\y k -J2 jG N( k ) w kjVj | 2 , cat¬ 
egory compactness principle tells us that a good Y 
should minimize the objective function 0 


min E Dsy(yk,W) = y^| y k ~ T Wkjyj\ 2 (9) 

k k j£N(k) 


By this way, local linear embedding algorithm can be 
resulted by minimizing 0 and 0. 


MDS(Kruskal and Wish, 1978): Let X = D x = 


[d k i\nxn, y = Dy = [d kl ]nxn, where = \x k - XI I, 

ctfd = \yk — yi\- It is easy to know that A = Y_ can¬ 
not hold. Therefore, categorization consistence princi¬ 
ple will be used, which requires that a good Y should 
minimize the objective function (1101) . 


ISOMAP(Tenenbaum et al., 200C)li : Let X = 

Dx = [d kl \nxn, y = Dy = [dj^] nX n,where d£ rep¬ 
resents the geodesic distance between x k and xi , 
4 = \Vk - Vi\- It is impossible for A = Y. Cate¬ 
gorization consistence principle requires to minimize 
(HOD . According to the above analysis, multidimen¬ 
sional scaling (MDS) algorithm can be used to com¬ 
pute Y. 


By this way, ISOMAP algorithm is introduced. 


6.2 Density Estimation 

If n points X \, X 2 ,- ■ ■, x n are sampled from a random 
variable with unknown probability density function /, 
then / is expected to be constructed from the observed 
data A = {x\,X 2 , • • • , x n }, which is called density es¬ 
timation. / is called expected density function. 

Set A = Y, A = /, Y = /, U = [1,1, - - - ,l]f X n, 
V = [1,1, - • - , l]f xn , density estimation can be con¬ 
sidered as a categorization problem with the catego¬ 
rization input (A, U,X_,Ds x ) and the categorization 
output (U V,Y, Dsy ), i.e. density estimation is a cat¬ 
egorization problem with only one category. In the 
following, / is called density estimator. 

Because all points belong to one category, U = V and 
A = y hold. However, A ^Y_. Therefore, UCR does 
not hold. 

One method of density estimation is parametric esti¬ 
mation. If p(x) is supposed to belong to the distri¬ 
bution family p{x\9 ), density estimation will be trans¬ 
formed into estimating 9. In other words, density es¬ 
timation will become parametric estimation. In this 
case, A = 9, Ds x (x,9) = — log(p(x|0)). Let 9 be 
the estimation of 9 , we have Y_ = 9,Dsy(x,9) = 
— log(p(x|0)) Therefore, category compactness princi¬ 
ple requires to minimize intra category variance, which 
results in the objective function (fill) . 

n n 

min y Dsy{x k ,6) = min V' - log(p(x k \9)) (11) 

9 e ,, 

k—1 k —1 

It is easy to know that maximum likelihood method is 
equivalent to minimizing ED- 

For example, let Vk,x k e R p ,x e R p : p{x\9) = 

Y 2 Jp^p ex P[- EEE~ M) ], where 9 = {p,,a 2p }. Ac¬ 
cording to Equation ED. the objective function (fl2l) 
can be inferred. 


minL(A, y) = L{D x ,Dy) (10) 

Naturally, multidimensional scaling (AIDS) algorithm 
can be introduced by minimizing the objective func¬ 
tion (flOl) . 


L = y~\og{p{x k \9)) 

k =1 


= E(^ - 2 k & 2p^ + lp g V27TPg 2 P) 


( 12 ) 





































Minimizing m can lead to the estimation of 
9 = {A, where A = £Li f,* 2p = ELi ^ 


Another method of density estimation is non paramet¬ 
ric estimation. In this method, less rigid assu m ption s 
are made about /. In the literature ( Silverman! Il986lf . 
non parametric density estimators include histograms, 
kernel density estimation, k-nearest neighbor method, 
etc. 

Clearly, the key problem for density estimation is to 
estimate the difference between / and /. In the¬ 
ory, the minimum difference between / and / should 
be expected according to categorization consistency 
principle. In the literature, theoretical conditions for 
f = f h ave been w ell studied in the limit point of 
view (Silverman, 1986i) . 


6.3 Regression 

Generally, if n points (xi,/(xi)), (x 2 ,/(x 2 )),-■ ■, 
(£„, f{x n )) are sampled from (x,f(x)) and / is not 
known but is expected to be learned, such a problem 
is called regression. Usually, / is called expected re¬ 
gression function. 



Xi 

fix l) ’ 


Xl 

Fix 1) ' 

Set X = 

x 2 

fix 2) 

,Y = 

X 2 

Fix 2) 


Xn 

fiXn) _ 


Xn 

F(x n ) _ 


X = (i, /(£)), Y = (x,F(x)), where F is called 
predicted regression function, U = [1,1, ■ • • , l]f xn , 
V = [1,1, ■ ■ • j IJi'xn, it is easy to know that regression 
has the categorization input {X,U,X_,Dsx) and the 
categorization output (Y, V,Y_, Dsy). In other words, 
regression can be considered as a categorization prob¬ 
lem with only one category. 

Because all points belong to one^ category, it is easy 
to prove that U = V and X = Y. However, I/f 
in general cases. Therefore, UCR does not hold. Ac¬ 
cording to categorization consistency principle, a good 
category representation Y should minimize the follow¬ 
ing objective function: 


\X-Y\ = D(f(x),F(x)) (13) 

It is impossible to directly compute D(f(x),F(x)) 
as / is unknown. Therefore, different definitions of 
D(f(x),F(x)) lead to different regression algorithms. 

For example, set f(x) € R and F(x) = wx T + b. As¬ 
sume that the dimensionality of x is r. 

If D(f{x),F(x)) = £Li || f(x k ) - F(x k ) || 2 , linear re¬ 
gression is obtained by minimizing (1131) if n >> r. 
When n << t, it is easy to know that many feasi¬ 
ble solutions can reach the same minimum of m as 


n « r implies that minimizing m faces singular 
problem. 

How to select the optimal solution from many feasi¬ 
ble solutions of minimizing (11311 ? A natural idea is to 
select the feasible solution with minimum norm. 

If using Euclidean norm, then D(f(x),F(x)) can be 
defined by ^ =1 ||/(xfe)-F(x fe )|| 2 + A||-u;|| 2 . Hence, ridge 
regression is obtained by minimizing m ■ 

When using L\ norm, then D(f(x),F(x)) can be de¬ 
fined by £fe=i ||(/(x fc ) - F(£' fc )|| 2 + A|H| Ll - By this 
way, L asso regres sion is obtained by minimizing (1131) 
( Tibshiranil . Il994h . 


6.4 Clustering 


For clustering, ( X , U,X_, Simx ) is called clustering in¬ 
put, (Y,V,Y, Simy) is called clustering result. Since 
U and V are unknown a priori for clustering, it is 
always supposed that the inner input and the corre¬ 
sponding inner output should be the same. It means 
that (X . Simx)=(Y . Simy j. Under that assumption, 
it is assumed that U = V for clustering. 

When Y = X, the outer input and the outer out¬ 
put are the same, which implies that ( X , U, X , Simx) 
and (Y, V, U, Simy) are exchangeable with respect to 
clustering. In a word, (X,U, X, Simx) also repre¬ 
sents clustering result. As Simx and Simy are the 
same, Sim can denote Simx and Simy fo r clustering. 
Hence , theoretical analysis on clustering in Yu and Xu 
(120141 is also true under new categorization interpre¬ 
tation of this paper. 


Even if Y ^ X, {U,2L.,X)={V,Y_,Y) also holds for 
clustering, which means that ECR and UCR are still 
true. In other words, ECR and UCR can always be 
omitted for clustering so that SS, CS and CE play 
more important role for clustering. Frankly speaking, 
SS, CS and CE are enough for clustering. Of course, 
when 7 / J, such clustering algorithms usually have 
feature extraction step such as spectral clustering. 


6.5 Classification 

For classification, a category is called a class. In order 
to be consistent with the literature, ( X , [7, X, Simx) is 
called classification training input and categorization 
result (Y,V,Y,Simy) is called classification training 
output in this section. More specifically, (X, U) is 
called the training set, (A, Simx) is called the ex¬ 
pected classifier, (Y,V) is called the training result, 
(Y_, Simy) is called the learned classifier. ECR and 
categorization axioms are usually true for classifica¬ 
tion. However, UCR is usually not true. 

If UCR is true, the classification error will be zero. 

























In practice, a classification method can only make its 
classification result to reach the minimum classifica¬ 
tion error, but usually its classification error is not 
zero. Therefore, UCR should be as a constraint for a 
classification problem. In other words, when dealing 
with a classification problem, UCR should be true as 
much as possible in probability. 


Transparently, decision region is used to judge which 
category one object should be assigned to, and the goal 
of the training decision region focuses on judging the 
quality of the classification result. 

6.5.1 Regression based Classification 


When U is a proper partition, the corresponding clas¬ 
sification problem is standard classification problem. 
When U is a overlapping partition, the correspond¬ 
ing classification problem is multi label classification 
problem. For multi label classification, SS should be 
generalized as \/k3i(i £ Xk))- Under such a generaliza¬ 
tion, multi label classification also follows SS. 

When classification result (Y, V,Y, Sim Y ) is out¬ 
putted, we can predict which category a new object 
should be assigned to. In theory, the decision region 
for a classification result (Y, V, Y, Simy) can be de¬ 
fined as follows: 

Decision Region: 

= {x\3i(y = i) A (y = 8(x)}. 

In particular, the decision region for a class Y can be 
defined as follows: 

Decision Region for a Class Y t : 

= {x\(y = i)A(y = 6(x)}. 

Therefore, it is easy to know that Ujfij = 0. 

The boundary for a classification result 
(Y, V,Y, Simy) can be defined as follows: 

Boundary: dQ = £2 — f2°, where £2 represents the 
closure of U, f2° represents the interior of f2. 

The training decision region can be defined as follows: 

Training Decision Region: f l(y,Sim Y ) = 

{x\3i3k((x £ 0*) A (xk £ f2j) A (Sim Y (9(x),Yi) > 
Simy(d(x k ),Yi)))}. 

Training Decision Region for a class Yj: fl Yi = 
{x|3fc((x £ f2j) A (a 'k £ f2j) A ( Sim Y (9(x),Yi ) > 
Simy(9(x k ),Yi)))}. 

The support vector for a classification result 
(Y, V,Y_, Simy) can be defined as follows: 

Support Vector: If Xk £ c^(r,Simy)> then Xk is 
called a support vector for the classification result 
{Y,V,Y, Simy). 

The margin for a classification result (Y. V. Y, Simy) 
can be defined as follows: 

Margin (ZSimy) = min d{£l Yi ,Sl Yj ), where 
d{n Yi ,n Y .) represents the distance between flxi and 

n Yj . 


In the literature, one common idea of designing a clas¬ 
sification algorithm is to transform classification to re¬ 
gression. In order to do this, regression function needs 
to be defined. In the following, we will do this accord¬ 
ing to the proposed axiomatic framework. 

Expected regression function can be defined as p(k) = 
Xk-, where U is a proper partition. Under this circum¬ 
stance, CE states that p{k) = Xk holds for a classifica¬ 
tion result. Similarly, when V is a proper partition, we 
set H(k) = Xk, then CE guarantees that H(k) = yk 
holds. 

Generally speaking, x denotes the input object rep¬ 
resentation and y denotes the corresponding output 
object representation. As y = 9(x), p(x) denotes x, 
the predicted regression function can be defined as 
h(x) = H(9(x)) = H(y) = y, i.e h{x) represents the 
predicted label. 
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(x,p{x)), Y = (x,h(x)). Therefore, classification can 
be considered regression. 


Using such denotation, UCR requires that X_ = Y, 
which means \/x(p(x) = h(x)). In practice, it is im¬ 
possible as p(x) is not known a priori but only p(xk) 
is known for k £ {1, 2, • • • ,n}. Therefore, it is natural 
to relax Yx(p(x) = h(x)) as P(p{x) ^ h(x)) < e. PAC 
theory has provided a theoretical investigation on suf¬ 
ficient conditions of making P{p(x) / h(x)) < £ hold 
with a probability not less than 1 — <5 (Vahant, 119841 1. 


Therefore, UCR is very important for classifi¬ 
cation. For developing a classification method, 
categorization consistency principle requires that 
Y^k=iL{p{xk),h(xk)) reaches the minimum, which is 
usually called minimizing empirical risk. Transpar¬ 
ently, neural networks can be introduced by minimiz¬ 
ing empirical risk. Usually, the more complexity of 
h(x), the more small the empirical risk. Therefore, the 
tradeoff between the empirical risk and the function 
complexity will lead to the structural risk (IVapnik . 
20001 . 


In particular, when c=2, p{x) £ {1,2}. Set 

h(x) = 1 + 7r(x) and L(p(x),h(x)) = —(p(x) — 
1) log(/i(x) - 1) - (2 - p(x))log {2 - h{x)) = ~(p(x) - 













1) log(7r(*)) — (2 — p{x))log(l — 7r(x)) where it{x) = 

1 +exp(wxT+b) ’ equation (HSI tells us that the objec¬ 
tive function of binomial logist ic regression model 
(iHosmer Jr and Lemeshowl . 120041 1 can be expressed as 
follows: 


n 

min E L(p(x k ) : h(x k )) 

— k= 1 
n 

= ~^2{p( x k) - l)(wx T + 6) (14) 

k =1 
n 

+ ^2 log(l + exp(wx T + 6) 
k =1 


6.5.2 Classification for X = Y 


However, many classification methods are not devel¬ 
oped by transforming classification to regression. In 
order to show this clearly, we simply assume Y = X, 
then classification result will omit Y as X is known 
a priori. By analysis in Section 15.41 it is enough to 
study (. X,U,Y_,Simy) under such simplification.Since 
(X, U) is known for classification, the simplest Y_ 
should be preferred according to Occam’s razor. In 
the following, U = [uik\cxn is a hard partition. 


Y 2 = (—w, —b — 1) are the simplest linear classifica¬ 
tion representation according to Occam’s razor. In 
this case, gi(x) = log Sim Y {x,Yi) = wx T + b — 1 
and g 2 (x ) = log Sim Y (x, Y 2 ) = —wx T — b — 1. Set 
wx T + b — 1 > 0 for Vxfc £ X\ and -tra T — b — 1 > 0 
for Wxk £ X 2 , categorization axioms hold. Therefore, 
category separation principle states that the optimal 
linear discrimination should keep the distance between 
the two parallel hyperplanes as large as possible when 
UCR holds, which leads to the famous support vector 
machine. 


It is easy to know that the training decision region for 
support vector machine is = {x\wx T + 6 — 

1 > 0 for Vxk £ Xi and — wx T — b — 1 > 0 for \/xk £ 
X 2 }. It is easy to prove that Margin^= 
^==== . Larger Margin^ t sim Y ) means a better gener¬ 
alization for support vector machine, which h as be en 
proved by statistical learning theory ( Vannik . 2000h . 


Example 4: Let Y, = (wi,Wio) where 1 < i < 
c — 1 but Y is unknown, and Sim Y (x,Yi) = 


exp(wiX T -\-Wi 0 ) 

exp(wiX T +u > i0 ) 
_ 1 _ 

exp{wiX T +w i0 ) 


ifl<j<c—1, SimY{x,Yc) = 
According to category compact¬ 


ness principle, we should maximize the objective func¬ 
tion can be expressed as follows: 


Example 1: It is the simplest to set Y_ = X, which 
means that = X,. Under such assumption, 

we do not know any essential information about Y_ 
except for X. When V'i, Y = Xi , it is natural to 


set Sim Y (y,Yi) = Sim Y (x,Y) = |A )E , Ni(x) = 
{xi\xi £ Xi A xi £ K-nearest neighborhood of x}. Un¬ 
der the above as sumption, K-nearest n eighbor classi¬ 
fication method ( Cover and Hartlll967t) is introduced. 
It is easy to know that the categorization result of 
K-nearest neighbor classification follows categorization 
axioms in general cases. Clearly, UCR does not hold 
for K-nearest neighbor classification in general. 


Example 2: Let X = [xk r \nx P , Sim Y {y,Yj) = 
Sim Y (x, Yj) and gi{x) = log Sim Y (x,Yi) be discrimi¬ 
nant function, SS requires that object x is assigned to 
class Y if gi(x) = max^ gj(x). Occam’s razor states 
that simpler Y is preferred. In theory, if Vi, Yi is rep¬ 
resented by (wi , vjio) where Wi is a 1 x p vector and 
wm £ R,gi(x) = log Sim Y (x,Yi) = WiX T + wm- Such a 
categorization model is simpler, whi ch is called linear 
discriminant analysis (iFisheij . Il936 1. Transparently, 
linear discriminant analysis also satisfies categoriza¬ 
tion axioms. 


Example 3: In particular, when c=2, it is natural to 
set Vi,Yi_ = (wi,Wio). Occam’s razor states that less 
parameters should be preferred. If set Y\ = (w, 6 — 1) 
and Y ‘2 = (—w,—b — 1), the number of free param¬ 
eters is the least. Therefore, Y± = (w, 6—1) and 


n c 

max y2 Uik log S im Y (x k , Y) 

nW,-,K-i fe=li=1 

n c— 1 

= u ik (wjxi + w iQ ) ( 15 ) 

k= 1 i= 1 

n c—1 

- ^ log(l + ^2 ex v{wixl + w i0 )) 

k =1 2=1 


Such categorization model is called logistic 
regression! Coxi 1958 b According to Occam’s ra¬ 
zor, logistic regression is more complex than linear 
discriminant analysis. When c > 2, logistic regression 
should not be considered as a regression model as no 
regression function can be defined. Moreover, the c th 
class can be considered noise in logistic regression. 


Example 5: For a categorization model, we do not 
need a concrete form Vi,Yi explicitly. No matter how 
complicated Y is, it is enough to compute Sim Y . If 
Sim Y (y,Y ) = Sim Y {x,Yi) = P{x,Y) and Vik = 
P{Y\xk), it is easy to know that Bayes classifier almost 
follows categorization axioms as the output y = x £ 
Yi just because Sim Y (x, Yj) = maxj Sim Y (x,Yj) = 
maxj P(x, Yj ) = P{x,Yi) and Bayes theorem guaran¬ 
tees that arg maXiP(x,Yi) = a,rgmaXiP(Yi\x). There¬ 
fore, it is very important for Bayes classifier to esti¬ 
mate Sim Y or V by (X, U). 






























In particular, assume that A = [ii; r ] n xp repre¬ 
sents n objects and x = [x*i,a;* 2 , 


, x* p \ repre¬ 
sents an object, where x* r is the r th feature. Ac¬ 
cording to categorization axioms, it is enough to 
calculate max^ P{x , Yj) in order to classify x. Ac¬ 
cording to Occam’s razor, we should select the sim¬ 
plest way to calculate P(x,\j). The simplest way 
to estimate P(x\Y-i) is to assume that each feature 
is conditionally independent of every other features 
for given Yj, then P(x\Y) = IIr=i p ( x *r IH)- Let 
P(Yj) = card ( Xi '> ; then Simy(x,Yi) can be computed 
by P{Yi) IIr=i P(x*r\Yi). Based o n t he abo ve analy¬ 
sis, naive Bayes classifier ( Duda et all 1973l f can clas¬ 
sify x according to categorization axioms. Therefore, 
naive Bayes classifier is the simplest Bayes classifier 
with respect to Occam’s razor. As Vik = P(Yj\xk) can 
be computed and V is a probability partition, Bayes 
classifier can be considered soft categorization. 


Example 6: Let Dsy(y,Yi) = Dsy(x,Yj) = 
R(a.i\x) = Y^j= 1 AjjP(Yj|:r), where the action on de¬ 
notes the decision to assign the output y to class Yj 
and Xij denotes the cost incurred for taking the action 
at when the input x belongs to Yj . Transparently, the 
categorization result of minimum risk classification al¬ 
most abides by categorization axioms. 


Example 7: Let S*my(y,Yj) = Simy{x,Yi _) = 
U(ati\x) = Ylj=iUijP(Yj\ x ), where the action on de¬ 
notes the decision to assign the output y to class Y and 
Uij measures how good it is to take the action on when 
the input x belongs to Yj. Maximum expected utility 
classifier also almost follows categorization axioms. 


Example 8: In the above examples, V*,Yj is repre¬ 
sented by one unique prototype, no matter what im¬ 
plicit or explicit. If assume that V«,Yj can be repre¬ 
sented by several prototypes, such a classifier is more 
complex. In decision tree classifier, Vi,Yj usually is 
represented by several mutual exclusive rules. It can 
be proved that decision tree classifier also follows cat¬ 
egorization axioms. 


6.5.3 Classification for X ^ Y 

When X ^ Y with p > d, supervised dimensionality 
reduction is proposed to deal with the corresponding 
categorization. When A' ^ Y with p < d, kernel meth¬ 
ods are proposed for categorization. In the following, 
we will discuss them respectively. 

Supervised Dimensionality Reduction 


For X j^Y with p > d, it is easy to know that y = 9{x) 
such that Yk,yk = 9(xk). The simplest 9 is a projec¬ 
tion mapping. If 9() is a projection mapping, super¬ 
vised dimensionality reduction becomes feature selec¬ 


tion. Feature selection methods can be easily inter¬ 
preted by categorization consistency principle. 

If 9 is not a projection mapping, the simplest 9 
is a linear mapping from R p to R. If there ex¬ 
ists a direction w such that all categories in (A, U) 
can be linearly separable when all points in (A, U) 
is vertically projected into the direction w, we set 
Yj = Xi_ = ViW T w, where w is a 1 x p vec¬ 
tor. Y = \zk\nxh where ww T = 1 ,Zk = XkW T , 

Vi = ^~‘ Xk \x*{ —_■ Ds.xjx, Xj)=(xw T w — Xj)(xw T w — 
A i) T =w(x - Vi) T (x - Vi)w T , Dsy(z,Yi)=(zw - 
Yi)(zw — Yj) T , it is easy to know that Dsx(x, Xj) = 

Dsy(z,Y ). 

According to category compactness principle, we need 
to minimize Yh Xk ^x- D s ( x k, Xj) = nwS\yw T . Ac¬ 
cording to category separation principle, we need to 
maximize i \Xi\w(vi — x) T (vi — x)w T =nwSBW T , 
where x = n^ 1 Y^,k=i Xk ~ Combining the above two 
functions, wS „ ww T should be minimized, which leads 
to the generalized Fisher linear discriminant analysis. 

In particular, when c = 2, it is easy to prove that 

(Aj_ — A 2 )(Ai — A 2 ) t = w{v\ — v 2 ) t (vi — V 2 )w T = 

wSbw t . Since |Ai|ui(i>i — x) T (v\ — x)w T + |A 2 1w(v 2 — 

x) T (v 2 - x)w T = w(vi - U 2 ) T (i>i - V 2 )w T + 

l-Xi | 2 1.X2I / \T ( \ T l-X^II-X^I ( 

|X|2 - v 2 y (vi - V 2 )W = |^| ' w(yi - 

v 2 ) t (vi — v 2 )w T , it is easy to know that to mini¬ 
mize w(vi — v 2 ) T (vi — v 2 )w t is equivalent to mini¬ 
mize £ i= i \Xi\w{vi — x) T (vi — x)w T , Therefore, when 
c = 2, generalized Fisher linear discriminant analy¬ 
sis becomes Fisher linear discriminant analysis. Cer¬ 
tainly, Fisher linear discriminant analysis follows UCR 
if (A, U) is linear separable in a direction w. 

Kernel Methods 


For A ^ Y with p < d, assume that Y is linearly 
separable and but A is not linearly separable, it is 
easy to know that 9() is a nonlinear mapping such 
that V7c, yk = 9{xk)- Sometimes, the dimensionality of 
Y is infinite. In this case, it is impossible to decide 9() 
by (X,U) and (Y,V). Fortunately, when ( Y_,Simy) 
is obtained, (A. Sim yj can be obtained by the kernel 
function K(x,Xk) = (9(x),9(xk)), where (9(x),9(xk)) 
represents the inner product. 


By defining K(x,Xk ), most categorization algorithms 
can be reinvented in kernel meth ods. In terested rea d- 
ers can read the article ( Scholkopf and Smolal 12011 ). 


In summary, classification models almost follow cate¬ 
gorization axioms. But different classification models 
have different model complexity. It should be pointed 
out that a complex model may be easily interpreted 
but a simple one may be difficult to be interpreted. 



















Sometimes, a simple categorization model is very dif¬ 
ficult to be discovered especially when it is not easy to 
be interpreted. 

7 Discussion and Conclusions 


Yu and Xu (2014:) have presented categorization ax¬ 
ioms based on the assumption that any category 
should have two ki nds of repre sentation. The main 
drawback of ( Yu and Xul 12014 1 is to ignore the clus¬ 
tering input by implicitly assuming the the clustering 
result and the clustering input should have the same 
category representation. However, the input and the 
output may not have the same category representa¬ 
tion, even for some clustering algorithms. Therefore, 
categorization axioms cannot directly be applied to a 
general learning algorithm. In particular, categoriza¬ 
tion axioms assume that the number of categories is 
greater than one, which is invalid for regression and 
manifold learning. 

In order to generalize categorization axioms into 
general categorization methods, we represent cate¬ 
gorization problems by redefining categorization in¬ 
put as (X , U. X , Sim x) and categorization result as 
(Y, V, Y_, Simy)- Based on this proposed representa¬ 
tions of categorization input and categorization re¬ 
sult, similarity (inner referring) operator and assign¬ 
ment (outer referring) operator are defined. Such two 
proposed operators are helpful not only for present¬ 
ing UCR but also for reinterpreting categorization ax¬ 
ioms. ECR, UCR, SS,CS and CE indeed delimit the 
theoretical constraints for categorization. In particu¬ 
lar, UCR offers the theoretical constraints for a perfect 
categorization algorithm, which guarantees that ex¬ 
pected to be learned is equivalent to actually learned, 
i.e. there are no gap between teaching and learn¬ 
ing. More interestingly, if taking (X, U,X_, Sinfix) and 
(Y, V, Y_, Simy) as a conversation between two per¬ 
sons, CE states that the outer category representa¬ 
tion is equivalent to the inner category representation 
with respect to categorization, which is consistent with 
maxim of quality in conve rsatio n: do not say what you 
believe to be false ( Cried . 1975 1. UCR states that the 
input and the output should refer to the same cat¬ 
egorization, which is also consistent with maxim of 
relation in conversation: make your contribution rele¬ 
vant ( Gricel . 19751) . When a dialogue can be efficiently 
carried out, UCR and CE should be true in daily life. 

As the same as lYu and Xul ( 2014 (1. a clustering result 
satisfying SS and CS cannot be guaranteed to be a 
good clustering result as SS and CS are too weak. 
Similarly, when developing a categorization algorithm, 
SS and CS also need to be enhanced, which respec¬ 
tively result in the category compactness principle and 



Figure 2: Relationship between Axioms and design 
principles for categorization 


the category separation principle under new proposed 
representation. In this paper, it is proposed that a 
categorization method should follow UCR in theory. 
However, UCR is too demanding for a categorization 
method. In many cases, UCR cannot hold and needs 
to be relaxed, which can lead to one design principle 
of categorization methods: the categorization consis¬ 
tency principle. The relation between the proposed 
axioms and design principles for categorization can be 
shown in Figure [2] 

After the learning process, how to evaluate the catego¬ 
rization algorithm is very important. Categorization 
test axiom provides the prerequisite condition that the 
performance of the categorization algorithm can be 
evaluated and local categorization robustness assump¬ 
tion has offered the condition that the performance of 
the categorization algorithm can be guaranteed to be 
stable. 

When c = 1, ECR, SS, CS and CE trivially hold, but 
UCR offers the theoretical condition for categoriza¬ 
tion. When c = 1, categorization becomes some di¬ 
mensionality reduction methods, density estimation, 
and regression. Some dimensionality reduction meth¬ 
ods, density estimation and regression and can be in¬ 
troduced by UCR or its approximated version (cat¬ 
egorization consistency principle), such as principal 
component analysis, nonnegative matrix factorization, 
canonical correlation analysis, local linear embedding, 
multidimensional scaling, Isomap, parametric density 
estimation, nonparametric density estimation, linear 
regression, ridge regression and lasso, and so on. The¬ 
oretically, when c = 1, categorization mainly discusses 
how to represent a category, which lays on a founda¬ 
tion for categorization with c > 1. 

When U is not known a priori for c > 1, categoriza¬ 
tion becomes clustering. ECR,UCR are always sup¬ 
posed to be true for any clustering algorithm in order 











































to simplify clustering process. Consequently, cluster¬ 
ing result and clustering input are exchangeable when 
X = Y. Therefore, SS, CS and CE are enough for 
clustering when X = Y. Theref ore, theoretical analy¬ 
sis of clustering in (Yu and Xm 12014) still is true when 
X = Y. 


When U is known a priori for c > 1, categorization 
becomes classification. As for classification, ECR and 
CE are always true for a classification result, but SS 
and CS are true for a proper classification result and 
UCR just holds for a classification result with zero er¬ 
ror. Therefore, SS, CS and UCR are more important 
constraints for classification. With respect to a classifi¬ 
cation result (Y,V,Y_, Simy), decision region,training 
decision region and margin are defined by SS. For cat¬ 
egorization methods, category compactness principle 
can result in K-nearest classification,linear discrimi¬ 
nant analysis,support vector machine, logistic regres¬ 
sion, Bayesian classification, Minimum risk classifica¬ 
tion, Maximum expected utility classification,decision 
tree,etc. Category separation principle can lead to 
support vector machine and Fisher linear discriminant 
analysis. Categorization consistency principle can lead 
to empirical risk and structural risk, which can re¬ 
sult in neural networks and binomial logistic regression 
model. 


UCR, SS, CS and CE play different roles in dif¬ 
ferent categorization algorithms but all have some¬ 
thing to do with similarity. It is well known that 
that sim ilarity plays a k ey role in hu man recognition 
system ( Murph y]. 2004 Hahn . 2014 ). Furthermore, 
Kloos and Sloutskvl ( 2008lf revealed that children rep¬ 


resent categories based on similarity and similarity- 
based category representation is a development de¬ 
fault. The proposed axiomatic framework indeed es¬ 
tablishes the bridge between cognitive science and ma¬ 
chine learning through similarity (inner referring) op¬ 
erator. 


questions needed to be done in the proposed axiomatic 
framework in the future. For example, how to de¬ 
sign an appropriate cognitive category representation 
for a specific categorization algorithm? When c > 2, 
how to solve similarity paradox? What conditions 
can make categorization robustness assumption hold? 
When (X, U) is partial known or noise, what is the re¬ 
lation between categorization axioms and categoriza¬ 
tion algorithms? 
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